Skip to content

Conversation

@gabotechs
Copy link
Collaborator

Adds a basic example for running distributed queries on top of a couple of parquet files, the same we use for tests.

@gabotechs gabotechs force-pushed the gabrielmusat/add-examples branch from d706ee3 to 2d39a7c Compare August 25, 2025 12:17
@gabotechs gabotechs force-pushed the gabrielmusat/refactor-arrow-flight-read branch from 2bdadca to d0fad3a Compare August 25, 2025 19:11
Copy link
Collaborator

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool. I have tested them on the branch. Thanks Gabriel


```shell
cargo run --example localhost_run -- 'SELECT count(*), "MinTemp" FROM weather GROUP BY "MinTemp"' --cluster-ports 8080,8081 --explain
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commands are so cool. Do you think for near future work, we are ready to work on supporting distributed-datafusion-cli defined in #4?

Maybe we add a new folder distributed-datafusion-cli similar to datafusion-cli to support this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I'm not sure what value would distributed-datafusion-cli bring on top of the normal datafusion-cli. As this is just a library for distributing queries, the concept of CLI becomes less relevant in this context.

If people anyways want to use the CLI, hopefully we can just reuse the normal datafusion-cli rather than building our own thing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing datafusion-cli is a good option as long as we provide a good way to have a default (e.g 3 workers) and easy-custom distributed settings

cluster_ports: Vec<u16>,

#[structopt(long)]
explain: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

```

The head stage will be executed locally in the same process as that `cargo run` command, but further stages will be
delegated to the workers running on ports 8080 and 8081.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define head stage? This makes me ask myself which part of the plan runs in head stage

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you visualize a plan as a tree of stages, the "head" or "root" stage is the top-level one, the first one counting from top to bottom. Added a clarifying comment.

@@ -0,0 +1,53 @@
# Localhost workers example
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked out your branch and run the below commands and get query result and explain back. Super cool!

Base automatically changed from gabrielmusat/refactor-arrow-flight-read to main September 2, 2025 08:31
@gabotechs gabotechs force-pushed the gabrielmusat/add-examples branch from 2d39a7c to a2d5822 Compare September 2, 2025 08:39
@gabotechs gabotechs merged commit 39697ab into main Sep 2, 2025
3 checks passed
@gabotechs gabotechs deleted the gabrielmusat/add-examples branch September 2, 2025 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants